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ABSTRACT 

From network topologies to online social networks, many of 
today’s most sensitive datasets are captured in large graphs. 
A significant challenge facing owners of these datasets is 
how to share sensitive graphs with collaborators and autho¬ 
rized users, e.g. network topologies with network equipment 
vendors or Facebook’s social graphs with academic collab¬ 
orators. Current tools can provide limited node or edge pri¬ 
vacy, but require modifications to the graph that significantly 
reduce its utility. 

In this work, we propose a new alternative in the form 
of graph watermarks. Graph watermarks are small graphs 
tailor-made for a given graph dataset, a secure graph key, 
and a secure user key. To share a sensitive graph G with a 
collaborator C, the owner generates a watermark graph W 
using G, the graph key, and C’s key as input, and embeds 
W into G to form G'. If G' is leaked by C, its owner can 
reliably determine if the watermark W generated for C does 
in fact reside inside G ', thereby proving C is responsible for 
the leak. Graph watermarks serve both as a deterrent against 
data leakage and a method of recourse after a leak. We pro¬ 
vide robust schemes for creating, embedding and extracting 
watermarks, and use analysis and experiments on large, real 
graphs to show that they are unique and difficult to forge. 
We study the robustness of graph watermarks against both 
single and powerful colluding attacker models, then pro¬ 
pose and empirically evaluate mechanisms to dramatically 
improve resilience. 

1. INTRODUCTION 

Many of today’s most sensitive datasets are captured in 
large graphs. Such datasets can include maps of autonomous 
systems in the Internet, social networks representing billions 
of friendships, or connected records of patent citations. Con¬ 
trolling access to these datasets is a difficult challenge. More 
specifically, it is often the case that owners of large graph 
datasets would like to share access to them to a fixed set 
of entities without the data leaking into the public domain. 
For example, an ISP may be required to share detailed net¬ 
work topology graphs with a third party networking equip¬ 
ment vendor, with a strict agreement that access to these sen¬ 
sitive graphs must be limited to authorized personnel only. 


Similarly, a large social network like Facebook or Linkedln 
may choose to share portions of its social graph data with 
trusted academic collaborators, but clearly want to prevent 
their leakage into the broader research community. 

One option is to focus on building strong access control 
mechanisms to prevent data leakage beyond authorized par¬ 
ties. Yet in most scenarios, including both examples above, 
data owners cannot restrict physical access to the data, and 
have limited control once the data is shared with the trusted 
collaborator. It is also the case that no matter how well ac¬ 
cess control systems are designed, they are never foolproof, 
and often fall prey to attacks on the human element, i.e. so¬ 
cial engineering. Another option is to modify portions of the 
data to reduce the impact of potential data leakages. This has 
the downside of making the data inherently noisy and inac¬ 
curate, and still can be overcome by data reconstruction or 
de-anonymization attacks using external input [|27l . Finally, 
these schemes are hard to justify, in part because it is very 
difficult to quantify the level of protection they provide. 

In this work, we propose a new alternative in the form 
of graph watermarks. Intuitively, watermarks are small, of¬ 
ten imperceptible changes to data that are difficult to re¬ 
move, and serve to associate some metadata to a particu¬ 
lar dataset. They are used successfully today to limit data 
piracy by music vendors such as Apple and Walmart, who 
embed a user’s personal information into a music file at the 
time of purchase/download {3j|. Should the purchased music 
be leaked onto music sharing networks, it is easy for Ap¬ 
ple to track down the user who was responsible for the leak. 
In our context, graph watermarks work in a similar way, by 
securely identifying a copy of a graph with its “authorized 
user.” Should a shared graph dataset be leaked and discov¬ 
ered later in public domains (on BitTorrent perhaps), the data 
owner can extract watermark from the leaked copy and use 
it as proof to seek damages against the collaborator respon¬ 
sible for the leak. While not a panacea, graph watermarks 
can provide additional level of protection for data owners 
who want to or must share their data, and perhaps encourage 
risk-averse data owners to share potentially sensitive graph 
data, e.g. encourage Linkedln to share social graphs with 
academic collaborators. 

To be effective, a graph watermark system needs to pro- 


1 


vide several key properties. First, graph watermarks should 
be relatively small compared to the graph dataset itself. This 
has two direct consequences: the watermark will be difficult 
to detect (and remove) by potential attackers, and adding the 
watermark to the graph has minimal impact on the graph 
structure and its utility. Second, watermarks should be diffi¬ 
cult to forge and should not occur naturally in graphs, ensur¬ 
ing that the presence of a valid watermark can be securely 
associated with some user, i.e. non-repudiation. Third, both 
the embedding and extraction of watermarks should be effi¬ 
cient, even for extremely large graph datasets with billions 
of nodes and edges. Finally, our goal is to design a water¬ 
mark system that works in any application context involv¬ 
ing graphs. Therefore, we make no assumptions about the 
presence of metadata. Instead, our system must function for 
“barebones” graphs, i.e. symmetric, unweighted graphs with 
no node labels or edge weights. 

In this paper, we present initial results of our efforts to¬ 
wards the design of a scalable and robust graph watermark 
system. Highlights of our work can be organized into the 
following key contributions. 

• First, we identify the goals and requirements of a graph 
watermark system. We also describe an initial design of 
a graph watermark system that efficiently embeds water¬ 
marks into and extracts watermarks out of large graphs. 
Graph watermarks are uniquely generated based on a user 
private key, a secure graph key, and the graph they are 
applied to. We describe constraints on its applicability, 
and identify examples of graphs where watermarks cannot 
achieve desirable levels of key properties such as unique¬ 
ness. 

• Second, we provide a strict proof of uniqueness of graph 
watermarks, showing that it is extremely difficult for at¬ 
tackers to forge watermarks. 

• Third, we evaluate our watermarks in term of distortion, 
false positive, and efficiency on a wide variety of large 
graph datasets. 

• Fourth, we identify two attack models, describe additional 
features to boost robustness, and evaluate them under real¬ 
istic conditions. 

To the best of our knowledge, our work is the first practi¬ 
cal proposal for applying watermarks to graph data. We be¬ 
lieve graph watermarks are a useful tool suitable for a wide 
range of applications from tracking data leaks to data au¬ 
thentication. Our work identifies the problem and defines an 
initial groundwork, setting the stage for follow-up work to 
improve robustness against a range of stronger attacks. 

2. BACKGROUND AND RELATED WORK 

In this section, we provide background and related work 
on the graph privacy problem and discuss the use of water¬ 
mark techniques in applications such as digital multimedia 
as well as graphs. 


Graph Privacy. Graph privacy is a significant problem 
that has been magnified by the arrival of large graphs con¬ 
taining sensitive data, e.g. Facebook social graphs or mobile 
call graphs. Recent studies 03 ED show that deanonymiza¬ 
tion attacks using external data can defeat most common 
anonymization techniques. 

A variety of solutions have been proposed, ranging from 
anonymization tools that defend against specific structural 
attacks, or more attack-agnostic defenses. To protect node- 
or edge-privacy against specific, known attacks, techniques 
utilize variants of k-anonymization to produce structural re¬ 
dundancy at the granularity of subgraphs, neighborhoods or 
single nodes lf23l l46l fill 47 j. Alternatively, randomization 
provides privacy protection by randomly adding, deleting, 
or switching edges mm. Others partition the nodes and 
then describing the graph at the level of partitions to avoid 
structural re-identification lfl2l . Finally, other solutions have 
taken a different approach, by producing model-driven syn¬ 
thetic graphs that replicate key structural properties of the 
original graph lf36l . One extension of this work utilizes dif¬ 
ferential privacy techniques to provide a tunable accuracy vs. 
privacy tradeoff EZ). 

The goals of our work are quite different from prior work 
on graph anonymization, meant to protect data before its 
public release. We are concerned with scenarios where 
graph data is shared between its owner and groups of trusted 
collaborators, e.g. third party network vendors analyzing an 
ISP’s network topology, or Facebook sharing a graph with 
a small set of academic researchers. The ideal goal in these 
scenarios is to ensure the shared data does not leak into the 
wild. Once data is shared with collaborators, reliable tools 
that can track leaked data back to its source serve as an ex¬ 
cellent deterrent. Watermarking techniques have addressed 
similar problems in other contexts, and we briefly describe 
them here. 

Background on Digital Watermarks. Watermarking is 
the process of embedding specialized metadata into multi- 
media content such as images or audio/video files lfT4) . This 
embedded watermark is later extracted from the file and used 
to identify the source or owner of the content. These systems 
include both an embedding component and an extraction or 
recovery component. The embedding component takes three 
inputs: a watermark, the original data, and a key. The wa¬ 
termark is embedded into the data in a way that minimizes 
impact on the data, i.e. transparent letters overlaid on top 
of an image. The key is used as a parameter to change the 
way the watermark is embedded, usually corresponds to a 
specific user, and is kept confidential by the data owner to 
prevent unauthorized parties from recovering and modifying 
the watermark. Extraction takes as input the watermarked 
data, the key, and possibly a copy of the original data. Ex¬ 
traction can directly produce the embedded watermark or a 
confidence measure of whether it is present. 

Significant work has been done in digital watermarking, 
particularly image watermarking l l38l l24l El l35l l43l . Image 
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watermarking techniques can be classified into two classes 
based on their working domains. The first class of water¬ 
marks is applied to the original domain of the image, the 
spatial domain. Basic techniques include modifying the least 
significant bits of each image pixel on the original image to 
encode the watermark ll38l F24l 151 . The second class applies 
watermarks to the transformed domain of the image, i.e. the 
frequency domain. The original data is first transformed into 
frequency domain using DCT iPTI . DFT li35l or DWT P31 . 
added a sequence of small noises to several invisible fre¬ 
quencies, and then the result is transformed back into spatial 
domain as the watermarked image. The sequence of noises 
is the watermark, and can be extracted by carrying out the 
reverse process on the watermarked image. 

Watermark techniques are already widely used today to 
protect intellectual property. Watermark techniques ll29l |30l l 
have been studied to protect the abuse of digital vector maps. 
Like image watermarks, these techniques can be classified as 
spatial domain methods and transformed domain methods. 
Unlike image watermarks, the spatial domain methods em¬ 
bed watermarks by modifying vertex coordinates j29l . while 
the transformed domain methods tend to transform vector 
maps into a different frequency domain, such as the mesh- 
spectral domain Eoj. Watermarks have also been used to 
protect software copyrights, by adding spurious execution 
paths in the code that would not be triggered by normal in¬ 
puts □ ED. These execution paths are embedded as extra 
control flows between blocks of code, and are triggered (or 
extracted) by either locating the code, or running the pro¬ 
gram with a special input that triggers the alternate execu¬ 
tion paths. Moreover, algorithms have been proposed for 
watermarking relational datasets JT] [22j fT3l . Much of this 
has focused on modifying numeric attributes of relations, re¬ 
lying on the primary key attribute as indicator of watermark 
locations, assuming that the primary key attribute does not 
change. Finally, watermarks, in the form of minute changes, 
have been applied to protect circuit designs in the semicon¬ 
ductor industry 13211421 . 

3. GOALS AND ATTACK MODELS 

To set the context for the design of our graph watermark 
system, we need to first clearly define the attack models we 
target, and use them to guide our design goals. 

Graph watermarks at a glance. At a high level, 
we envision the graph watermark process to be simple and 
lightweight, as pictured in Figure Q] Embedding a water¬ 
mark involves overlaying the original graph dataset (G) with 
a small subgraph (W) generated using the original graph and 
a secret random generator seed (fl). Embedding the water¬ 
mark simply means adding or deleting edges between ex¬ 
isting nodes in the original graph G, based on the water¬ 
mark subgraph W . Each authorized user i receives only a 
watermarked graph customized for them, generated using a 
random seed fli securely associated with her. The seed is 
generated through cooperation of her private key and a key 




(b) Extraction 


Figure 1: Embedding and extracting graph watermarks. 
Q is a secret random generator seed produced using the 
secure graph key and user’s private key. 

securely associated with the original graph. 

If and when the owner detects a leaked version of the 
dataset, the owner takes the leaked graph, and “extracts the 
watermark,” by iteratively producing all known watermark 
subgraphs Wi associated with G and each of the seeds O, 
associated with an authorized user. The “extraction” process 
is actually a matching process where the data owner can con¬ 
clusively identify the source of the leaked data, by locating 
the matching Wi in the leaked graph. 

In our model of potential attackers and threats, we assume 
that attackers have access to the watermarked graph, but not 
the original G. Clearly, if an attacker is able to obtain the un¬ 
altered G, then watermarks are no longer necessary or use¬ 
ful. 

Attack Models. The attackers’ goal is to destroy or re¬ 
move graph watermarks while preserving the original graph. 
Watermarks are designed to protect the overall integrity of 
the graph data. Thus we do not consider scenarios where the 
attackers sample the graph or distort it significantly in order 
to remove the watermark. Doing so would be analogous to 
removing a portion of all pixels from a watermarked video, 
or applying a high pass frequency filter to watermarked mu¬ 
sic. Under these constraints, we consider two practical at¬ 
tack models below. 

• Single Attacker Model. For a single attacker with access 
to one watermarked graph, it will be extremely difficult to 
detect the watermark subgraph. Without the key associated 
with another user, forging a watermark is also impractical. 
Instead, their best attack is to disrupt any potential water¬ 
marks by making modifications, i.e. add or delete nodes or 
edges. 

• Collusion Attack Model. If multiple attackers join their ef¬ 
forts, they can recover the orginal graph by comparing mul¬ 
tiple watermarked graphs, identifying the differences (i.e. 
watermarks), and removing them. 
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Design Goals. These attack models help us define the key 
characteristics required for an effective graph watermarking 
system. 

• Low distortion. The addition of watermarks should have 
a small impact on overall structure of the original graph. 
This preserves the utility of the graph datasets. 

• Robust to modifications. Watermarks should be robust to 
modification attacks on watermarked graphs, i.e. water¬ 
marks should remain detectable and extractable with high 
probability, even after the graph has been modified by an 
attacker. 

• Low false positives. It is extremely unlikely for our sys¬ 
tem to successfully identify a valid watermark Wi in an un¬ 
watermarked graph or a graph watermarked by W 3 where 
i f j. When we embed a single watermark (Section[4]), we 
also refer to this property as watermark uniqueness. 

Within the constraints defined above, designing a graph 
watermark system is quite challenging, for several reasons. 
First, the subgraph that represents the watermark must be 
relatively “unique,” i.e. it is highly unlikely to occur natu¬ 
rally, or intentionally through forgery. A second, contrasting 
goal is that the watermark should not change the underly¬ 
ing graph significantly (low distortion), or be easily detected. 
Walking the fine line between this and properties of “unique¬ 
ness” likely means we have to restrict the set of graphs which 
can be watermarked, i.e. for some graphs, it will be impos¬ 
sible to find a hard to detect watermark that does not occur 
easily in graphs. Finally, since any leaked graph can have all 
metadata stripped or modified, watermark embedding and 
extraction algorithms must function without any labels or 
identifiers. Note that the problem of subgraph matching is 
known to be NP-complete (8). 

4. BASIC WATERMARK DESIGN 

We now describe the basic design of our graph watermark¬ 
ing system. The basic design seeks to embed and extract wa¬ 
termarks on graphs to achieve watermark uniqueness while 
minimizing distortion on graph structure. Our design has 
two key components: 

• Watermark embedding: The data owner holds a graph 
key K g associated with a graph G known only to her. 
Each user i generates its public-private cryptographic key 
pair < K’ [mb . K priv > through a standard public-key algo¬ 
rithm l25l , where K pub is user i’s public key and K priv is 
its corresponding private key. To share the graph G with 
user i, the system combines input from user i digital signa¬ 
ture K priv (T) and graph key K (: to form a random gen¬ 
erator seed Lli, and use O, to generate a watermark graph 
Wi for graph G. The system embeds Wi into G by select¬ 
ing and modifying a subgraph of G that contains the same 
number of nodes as Wi. The resulting graph G Wi is given 
to user i as the watermarked graph. 

• Watermark extraction: To identify the watermark in G', 


we use f1,; to regenerate Wi and then search for the exis¬ 
tence of Wi within G'. for each user i. 

In this section, we focus on describing the detailed proce¬ 
dure of these two components. We present detailed analysis 
on the two fundamental properties of graph watermarks, i.e. 
uniqueness and detectability in Section^ 

4.1 Watermark Embedding 

The most straightforward way to embed a watermark is 
to directly attach the watermark graph to the original graph. 
That is, if W t represents the watermark graph for user i, and 
G represents the original graph to be watermarked, the em¬ 
bedding treats Wi as an independent graph, and adds new 
edges to connect Wi to G. However, this approach has 
two disadvantages. First, direct graph attachment makes 
it easy for external attackers to identify and remove W) 
from G without using graph key K G and user *’s signature 
Kpri V (T). New edges connecting Wi and G must be care¬ 
fully chosen to reduce the chance of detection, and this is a 
very challenging task. Second, attaching a (structurally dif¬ 
ferent) subgraph Wi directly to a graph G introduces larger 
structural distortions. 

Instead, we propose an alternative approach that embeds 
the watermark graph “in-band.” That is, the embedding pro¬ 
cess first selects k nodes (k is the number of nodes in Wi) 
from G and identifies S, the corresponding subgraph of G 
induced by these k nodes. It then modifies S using Wi with¬ 
out affecting any other nodes in G. Because the watermark 
graph Wi is naturally connected with the rest of the graph, 
both the risk of detection and amount of distortion induced 
on the original graph G are significantly lower than those of 
the direct attachment approach. 

We now describe the details of “in-band” watermark em¬ 
bedding, which consists of four steps: (1) generating ran¬ 
dom generator seed f f from user i’s signature K priv (T) and 
graph key K G ; (2) generating the watermark graph IT', from 
the seed f Ip, (3) selecting the placement of Wi on G by pick¬ 
ing k nodes from G and identifying the corresponding sub¬ 
graph S induced by these k nodes; and (4) embedding IT, 
into G by modifying S to match structure of IT,. 

Step 1: Generating random generator seed II,. To gen¬ 
erate an unforgetable watermarked graph, we generate a ran¬ 
dom generator seed 12^ f9) using user i’s signature K priv (T) 
and graph key K a . 

Suppose the system intends to generate a watermarked 
version of graph G at time T to share with a specific user 
i. We begin by first sending user i with the current times¬ 
tamp T. User i responds with its signature K priv (T), by 
encrypting the timestamp with its private key K priv . Before 
proceeding further, we validate the result K priv (T) to ensure 
it is from i, by decrypting it with user i’s public key K' puh . If 
the timestamps match, we combine the signature K priv (T ) 
and the graph key K G to form the seed of the random graph 
generator for user i, 0,. A mismatch may indicate that user 
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i is a potential malicious user. 

Note that O, cannot be formed alone by the data owner 
who only holds the graph key K G , or by user i who only 
owns its private key K' jrriv . Therefore, results computed us¬ 
ing seed fincluding the random graph If', generated (Step 
2) and the choice of graph nodes to mark (Step 3), cannot 
be derived independently by the data owner or identified by 
user i. 

Step 2: Generating the watermark graph W,;. We gen¬ 
erate Wi as a random graph with edge probability of p and 
node count k (k << n where n is the number of nodes in 
G). The random edge generator uses f \ as the seed 0. The 
k nodes of W 2 are ordered as {iq, i> 2 ,iq-}. 

The key factor in this step is choosing the node count k 
and the edge probability p. As we will show in Section [5~i~l 
the two parameters must satisfy the following requirement 
to ensure watermark uniqueness: 

k > (2 + <5) log g n (1) 

where q = max A and 5 is a constant > 0. Furthermore, 
it is easy to prove that p = | minimizes the node count k and 
the average edge count p ■ ( k 2 ) of the watermark graph W 2 . 
Intuitively, using a compact watermark graph not only re¬ 
duces the amount of distortion to G, but also improves its ro¬ 
bustness against malicious attacks. Therefore, we configure 
p = i and therefore k = (2 + 5) log 2 n. This produces a rea¬ 
sonably sized watermark graph (k <100) even for extremely 
large graphs, e.g. the complete Facebook social graph (~ 1 
billion nodes in 2014). 

Step 3: Selecting the watermark placement on graph G. 

Next, we identify k nodes from G and its corresponding sub¬ 
graph S to embed the watermark graph. To ensure reliable 
extraction, we must choose these k nodes carefully, meeting 
these two requirements. First, using ( 1, generated in Step 
1, the k nodes must be chosen deterministically and remain 
distinguishable from the other nodes of G. Second, the set 
of the k nodes chosen for different watermarks (or different 
fij values) must be easily distinguishable from each other to 
reinforce watermark uniqueness. Our biggest challenge in 
meeting these requirements is that we cannot use node IDs to 
distinguish nodes from each other. Node IDs or any type of 
metadata can be easily altered or stripped by attackers before 
or after leaking G', thereby making extraction impossible. 

We address this challenge by using local graph structure 
around each node as its “label.” Specifically, we define a 
node structure description (NSD) as a distinguishable fea¬ 
ture of each node. A node v’s NSD is represented by an 
array of v’s sorted neighbor degrees. For example, if node v 
has three neighbors with node degrees 2, 6, 4, respectively, 
then v’s NSD label is “2-4-6.” We then hash v’s NSD la¬ 
bel into a numerical value using a secure one-way hash e.g. 
SHA-1 (34), and refer to the result as node v’s NSDhash. 

Next, we use f \ as the seed to randomly generate k hash 
values, and use each as an index (e.g. using a mod function) 
to identify a node in G. It is possible that multiple nodes 


have the same NSDhash, i.e. a collision. If this happens, 
we resolve the collision by using O, again as an index into a 
sorted list of these nodes with the same NSDhash. The nodes 
can be sorted by any deterministic order, e.g. node IDs in 
the original graph. Note that this process is only required for 
embedding (and not extraction), so any deterministic order 
chosen by the graph owner will suffice. 

At the end of this step, we obtain k ordered nodes from 
G, X = {xi,X 2 ,...,Xk}, and the corresponding subgraph 
S = G[X] induced by the node set X on G. 

Step 4: Embedding the watermark graph W, into graph 

G. In this step, we embed the watermark graph W t by 
modifying the subgraph S = G[X) to match W t . Specif¬ 
ically, we match each (ranked) node in Wj, {iq, V 2 , ■■■, rq-} 
with the corresponding node in S (or X), { X\,X 2 , ...,Xk}, 
i.e. f : W —> S, f(v 2 ) = aq. And once the nodes are 
mapped, we then apply an XOR operation on each edge of 
the two graphs. That is, we consider the connection between 
(vi, Vj) or ( Xi,Xj ) as one bit, i.e. an edge between (v;, Vj) 
or (xi,Xj) means 1 and no edge between (vi, Vj) or (x 2 ,Xj) 
means 0. If an edge ( Vi,Vj ) exists in Wi, we modify the 
corresponding edge value in S from (xi, Xj) to (aq, Xj)(B 1; 
and if no edge (iq,Vj) exists in Wi, we modify the edge 
value (xi,Xj) to (aq, Xj) ® 0. When the above edge modifi¬ 
cation process ends, we also explicitly create edges between 
nodes aq and aq+i to maintain a connected subgraph. As 
a result, we transfer the subgraph S into S Wi with the wa¬ 
termark graph Wi embedded. The reason for choosing the 
XOR operation is that it allows the same watermark to be 
embedded in the graph multiple times (at multiple locations), 
thus reducing the risk of the watermark being detected and 
destroyed by attacks such as frequent subgraph mining. We 
will discuss this in more details in Section^ 

At the end of this step, we obtain a watermarked graph 
G Wi for user i. Before we distribute it to user i, we 
anonymize G Wi by completely (randomly) reassigning all 
node IDs. Such anonymization not only helps to protect 
user privacy, but also minimizes the opportunity for collud¬ 
ing attackers with multiple watermarked graphs to identify 
the embedded watermark (see Section^. 

4.2 Watermark Extraction 

The watermark extraction process determines if a water¬ 
mark graph Wi is embedded in a target graph G'. If so, then 
G' is a legitimate copy distributed to user i. The extraction 
process faces two key challenges. First, the target graph G', 
likely a watermarked version of the original graph G, can 
easily be modified by users/attackers during the graph distri¬ 
bution process. In particular, all node IDs can be very differ¬ 
ent from that of the original G. Thus extraction cannot rely 
on node IDs in G'. Second, identifying whether a subgraph 
exists in a large graph is equivalent to a subgraph match¬ 
ing problem, known to be NP-complete. To handle massive 
graphs, we need a computationally efficient algorithm. 

Our design addresses these two challenges by leveraging 
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knowledge on the structure of the subgraph where the wa¬ 
termark was embedded. This eliminates the dependency on 
node IDs while significantly reducing the search space dur¬ 
ing the subgraph matching process. We describe our pro¬ 
posed design in detail below. 

Step 1: Regenerating the watermark. The owner per¬ 
forms the extraction, and has access to the original graph G, 
graph key K G , and user’s signature K^ riv ( T ). For each user 
i, we combine its signature Kp riv (T) and graph key K G to 
generate the random generator seed tti for that user. Then, 
we follow step 2 — 4 described in Section |4~TI to regenerate 
the watermark graph IT,, identify the k ordered nodes from 
G and their NSD labels, and finally the modified subgraph 
S Wi that was placed on a “clean” version of the watermarked 
graph G Wi . 

Step 2: Identifying candidate watermark nodes on G'. 

Given the k nodes X = {xi,X 2 ,..., identified from the 
original graph G, in this step we need to identify for each 
Xj, a set of candidate nodes on the target graph G' that can 
potentially become x 7 . We accomplish this by identifying 
all the nodes on G' whose NSD labels are the same of x 3 in 
the “clean” version of the watermarked graph G Wi . Since 
multiple nodes can have the same NSD label, this process 
will very likely produce multiple candidates. To shrink the 
candidate list, we examine the connectivity between candi¬ 
date nodes of X on G' and compare it to that among X on 
G Wi . If two nodes x m and x n are connected in G Wi , we 
prune their candidate node lists by removing any candidate 
node of x rn that has no edge with any candidate node of x n 
on G' and vice versa. This pruning process dramatically re¬ 
duces the search space. After this step, we obtain for each 
Xi the candidate node list C, on the target graph G'. 

Step 3: Detecting watermark graph S Wi on G'. Given 
the candidate node list of each node in X, we now search 
for the existence of S Wi on the target graph G'. For this 
we apply a recursive algorithm to enumerate and prune the 
combinations of the candidate sets, until we identify S Wi 
or exhaust all the node candidates. The detailed algorithm 
is listed in Algorithm 1. In this algorithm, we use a node 
list Y to record the list of nodes in G' which we have 
already finalized as the corresponding nodes in S Wi , i.e. 

Y = {yi, y 2 ,..., y m } (jn < k). When the process starts, 

Y = 0, to = 0. 

Discussion. The above design shows that our watermark 
extraction algorithm simplifies the subgraph search problem 
by restricting it to a small number of selected nodes from 
a graph, thus avoiding the NP-complete subgraph matching 
problem. Also note that we target real graphs with very high 
levels of node heterogeneity, e.g. small-world, power-law or 
highly clustered graphs, which are very far from the uniform, 
lattice-like graphs that are the worst case scenarios for graph 
isomorphism. In practice, our system can efficiently extract 
watermarks from real, million-node graphs, and do so in a 
few minutes on a single commodity server (Section 1731 . 


Algorithm 1 Recursive Algorithm for Detecting S Wi on G'. 

1: 

Function: SubgraphDetection(G / 

S w * 

,{C'i,C 2 , 

-,Ck}, Y, m) 

2: 

Input: Graph G' , watermark graph S Wi , candidate node list Ci for 


each node Xi in X, identified node 

list Y 

= {yi , y2 

...,ym} (m < k ) 

3: 

Output: Identified node list Y 




4: 

for each node c E Cm+i do 




5: 

if c 0 Y and each edge (c, yt) 

in G' 

(f = 1 ..m) 

is the same as the 


edge (xm + i,xt) in S Wi (t = 

l..m 

then 


6: 

Y = Y U c 




7: 

m = m + 1 




8: 

if m == k then 




9: 

Return Y 




10 

else 




11 

SubgraphDetection(G / , 

S W Y 

{c u c 2 ,.. 

,C k },Y, m) 

12 

end if 




13 

u 

II 

lx 




14 

m = m — 1 




15 

end if 




16 

end for 




17 

Return Y 





5. FUNDAMENTAL PROPERTIES 

Having described the basic watermark system, we now 
present detailed analysis on its two fundamental proper¬ 
ties: watermark uniqueness where each watermark must be 
unique to the corresponding user, and watermark detectabil¬ 
ity where the presence of a watermark should not be easily 
detectable by external users without the knowledge of the 
seed Di associated with user i. 

5.1 Watermark Uniqueness 

As a proof of ownership, each embedded watermark 
should be unique for its user. That is, given the original 
graph G and the seed f \ associated with user i, the em¬ 
bedded watermark graph S Wi should not be isomorphic to 
any subgraph of G Wj (i ^ j) where G Wj is the water¬ 
marked graph for user j. At the same time, S Wi should not 
be isomorphic to any subgraph of the original graph G. In 
the following, we show that with high probability, our pro¬ 
posed graph watermark system produces unique watermarks 
for any graph G. 

THEOREM 1. Given a graph G with n nodes, let k > 
(2 + S) log 2 n for a positive constant S > 0. We apply the 
following process to create a watermarked graph G Wi for 
user i: 

• We create k nodes, V = {vi,V 2 , ...,Vk}, and generate a 
random graph Wi on V with an edge probability of 

• We randomly select k nodes, X = {xi, X2, ■■■, Xk}from G, 
and identify the subgraph corresponding to these k nodes 
S = G[X ]. 

• Using Wi, we modify S as follows: we first map each 
node Xi in X to a node m in V. Let e(u, v) = 1 denote an 
edge exists between node u and v and e(u, v) = 0 denote 
otherwise. We modify each e(xi,Xj) in S to e{xi,x 3 ) ® 
e(vi, Vj ). We then explicitly connect nodes Xi and x-i+ i, 
i.e. e{xi, Xi+i) = 1. The resulting S now becomes S Wi , 
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Table 1: Suitability of watermarking for 48 of today’s network graphs, determined by comparing their node degree dis¬ 
tribution [N min {G), N max (G)\ and /. -node subgraph density [D min (k), D max (k)\ to those of the embedded watermark 
graphs. 35 out of these 48 graphs are suitable for watermarking. 


Graph 

Graph 

# of Nodes 

# of Edges 

Avg. Deg. 


Node Degree Criterion 

A;-node Subgraph Density Criterion 

Suitability 

Category 


{k + l)/2 


Watermark 

[ D min (k),D max (k )] 


Russia 

97,134 

289,324 

6.0 

39 

20 

[1,748] 

390 

[45, 701] 

Yes 

Facebook 

L.A. 

603,834 

7,676,486 

25.4 

45 

23 

[1,2141] 

517 

[44, 975] 

Yes 


London 

1,690,053 

23,084,859 

27.3 

48 

24 

(1, 1483] 

588 

[47, 1128] 

Yes 


Epinions (1) 

75,879 

405,740 

10.7 

38 

19 

[1,3044] 

370 

[47,649] 

Yes 


Slashdot (08/11/06) 

77,360 

507,833 

13.1 

38 

19 

[1,2540] 

370 

[38, 668] 

Yes 


Twitter 

81,306 

1,342,303 

33.0 

38 

19 

[1,3383] 

370 

[44, 703] 

Yes 

Other 

Slashdot (09/02/16) 

81,867 

497,672 

12.2 

38 

19 

[1,2546] 

370 

[38, 669] 

Yes 

Social 

Slashdot (09/02/21) 

82,140 

500,481 

12.2 

38 

19 

[1,2548] 

370 

[38, 669] 

Yes 

Networks 

Slashdot (09/02/22) 

82,168 

543,381 

13.2 

38 

19 

[1,2553] 

370 

[38, 673] 

Yes 


GPlus 

107,614 

12,238,285 

227.5 

39 

20 

[1,20127] 

389.5 

[53, 741] 

Yes 


Epinions (2) 

131,828 

711,496 

10.8 

40 

20 

[1,3558] 

409.5 

[51,780] 

Yes 


Youtube 

1,134,890 

2,987,624 

5.3 

47 

24 

[1, 28754] 

563.5 

[47,815] 

Yes 


Pokec 

1,632,803 

22,301,964 

27.3 

48 

24 

[1, 14854] 

587.5 

[47, 979] 

Yes 


Flickr 

1,715,255 

15,555,041 

18.1 

48 

24 

[1,27236] 

588 

[51, 1128] 

Yes 


Livejournal 

5,204,176 

48,942,196 

18.8 

52 

26 

[1, 15017] 

689 

[51, 1326] 

Yes 

Citation 

Patents 

23,133 

93,468 

8.1 

34 

17 

[1,280] 

297 

[37, 373] 

Yes 

Networks 

ArXiv (Theo. Cit.) 

27,770 

352,304 

25.4 

34 

17 

[1,2468] 

297 

[36, 534] 

Yes 


ArXiv (Phy. Cit.) 

34,546 

420,899 

24.4 

35 

18 

[1,846] 

314.5 

[36, 544] 

Yes 


ArXiv (Phy.) 

12,008 

118,505 

19.7 

32 

16 

[1,4911 

263.5 

[45, 496] 

Yes 

Collaboration 

ArXiv (Astro) 

18,772 

198,080 

21.1 

33 

17 

[1,504] 

280 

[37, 528] 

Yes 

Networks 

DBLP 

317,080 

1,049,866 

6.6 

43 

22 

[1,343] 

472.5 

[43,903] 

Yes 


ArXiv (Condense) 

3,774,768 

16,518,947 

8.8 

51 

26 

[1,793] 

663 

[50,1063] 

Yes 

Communication 

Email (Enron) 

36,692 

183,831 

10.0 

35 

18 

[1.1383] 

314.5 

[43,515] 

Yes 

Networks 

Email (Europe) 

265,214 

365,025 

2.8 

42 

21 

[1,7636] 

451 

[74,683] 

Yes 


Wiki 

2,394,385 

4,659,565 

3.9 

49 

25 

[1, 100029] 

612 

[65, 1066] 

Yes 


Stanford 

281,903 

1,992,636 

14.1 

42 

21 

[1,38625] 

451 

[66,861] 

Yes 

Web 

NotreDame 

325,729 

1,103,835 

6.8 

43 

22 

[1,10721] 

472.5 

[60,903] 

Yes 

graphs 

BerkStan 

685,230 

6,649,470 

19.4 

45 

23 

[1,84230] 

517 

[79,990] 

Yes 


Google 

875,713 

4,322,051 

9.9 

46 

23 

[1,6332] 

540 

[72, 1033] 

Yes 

Location based 

Brightkite 

58,228 

214,078 

7.4 

37 

19 

[1.1134] 

351 

[41,665] 

Yes 

OSNs 

Gowalla 

196,591 

950,327 

9.7 

41 

21 

[1,14730] 

430 

[44,723] 

Yes 


Oregon (1) 

11,174 

23,409 

4.2 

31 

16 

[1,2389] 

247.5 

[95,352] 

Yes 

AS 

Oregon(2) 

11,461 

32,730 

5.7 

32 

16 

[1,2432] 

263.5 

[79,476] 

Yes 

Graphs 

CAIDA 

26,475 

53,381 

4.0 

34 

17 

[1,2628] 

297 

[113,436] 

Yes 


Skitter 

1,696,415 

11,095,298 

13.1 

48 

24 

[1,35455] 

588 

[52, 1128] 

Yes 


Gnutella (02/08/04) 

10,876 

39,994 

7.4 

31 

16 

[1,103] 

247.5 

[30,80] 

No 


Gnutella (02/08/25) 

22,687 

54,705 

4.8 

34 

17 

[1,66] 

297 

[0,0] 

No 

P2P networks 

Gnutella (02/08/24) 

26,518 

65,369 

4.9 

34 

17 

[1,355] 

297 

[0,44] 

No 


Gnutella (02/08/30) 

36,682 

88,328 

4.8 

35 

18 

[1,55] 

314.5 

[35,70] 

No 


Gnutella (02/08/31) 

62,586 

147,892 

4.7 

37 

19 

[1,95] 

351 

[39,76] 

No 


Amazon (03/03/02) 

262,111 

899,792 

6.9 

42 

21 

[1,420] 

451 

[88,132] 

No 

Amazon 

Amazon (2012) 

334,863 

925,872 

5.5 

43 

22 

[1.549] 

472.5 

[0,0] 

No 

Co-purchasing 

Amazon (03/03/12) 

400,727 

2,349,869 

11.7 

43 

22 

[1,2747] 

472.5 

[52,285] 

No 

Networks 

Amazon (03/06/01) 

403,394 

2,443,408 

12.1 

43 

22 

[1,2752] 

473 

[52, 333] 

No 


Amazon (03/05/05) 

410,236 

2,439,437 

11.9 

43 

22 

[1,2760] 

472.5 

[50,333] 

No 

Road 

Pennsylvania 

1,088,092 

1,541,898 

2.8 

47 

24 

[1.9] 

563.5 

[0,0] 

No 

Networks 

Texas 

1,379,917 

1,921,660 

2.8 

47 

24 

[1,12] 

563.5 

[0,0] 

No 


California 

1,965,206 

2,766,607 

2.8 

49 

25 

[1,12] 

612 

[0,0] 

No 


and the resulting G becomes G Wi . 

Let G Wl denote a watermarked graph for user l (l ^ i), 
built using a different seed f2;. Then with low probability, 
any subgraph of G Wl or G is isomorphic to S Wi . 

Proof. We first show that with low probability, any sub¬ 
graph of G Wl is isomorphic to S Wi . Let Y = {j/i, t/ 2 , ■ ■■Vk} 
be a set of ordered nodes in G Wl , where each y 3 maps to a 
node Xj in X. We define an event £y occurs if the subgraph 
G Wl [V] is isomorphic to G Wi [X] or S Wi . Then the event £ 
representing the fact that there exists at least one subgraph 
on G Wl that is isomorphic to S Wi is the union of events £y 
on all possible Y, i.e. £ = U y£y. 

Next, we compute the probability of event £ by those 
of individual event £y. Specifically, we first show that 
the probability of an edge exists between node Xi and x 3 


(jfi + 1) in S Wi = G Wi [X] is This is because each 
edge in the random graph Wi is independently generated 
with probability i. After performing the XOR operation be¬ 
tween Wi and S, the probability of an edge exists between 
Xi and Xj ( j i + 1) on S Wi is \ • + (1 ~ Pij) ■ \ = \ 

where pi 3 is the probability that an edge exists between Xi 
and x 3 on S. Thus the result of XOR between Wi and S is 
also a random graph, and its edge generation is independent 
of that in G Wl . 1 f i. Furthermore, it is easy to show that our 
design applies XOR operations on (!j) — (k — 1) node pairs 
on the k nodes, and each node pair has an edge with a prob¬ 
ability of i. Thus, the probability of a subgraph G W ‘[Y] 

being isomorphic to S Wi is P{£y) = ^ ^ ■ fl where 

/3 < 1 is the probability that every ( yi , t/i+i) pair in G Wl \Y] 
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is connected. Thus P(£y) < ^ 

Since £ = Uy£y and there are less than n k possible sets 
of k ordered nodes in G Wl , we use the Union Bound to com¬ 
pute the probability of event £ as follows: 


, , tf-Opk- 1 ) 

P{£) < n k ■ P{£ y ) < n k ■ - W 

o . k 2 — 3k , , 5k 2 3 k . , 

k±_ 1 3- 1-1 1 2(2 + i) 3" ' 1 

= 2 2 + i ■ - = - 

2 2 


( 2 ) 


The above equation shows that the probability P{£) reduces 
exponentially to 0 as k increases. 

Finally, we can apply the same method to show that with 
low probability, any subgraph of G is isomorphic to S Wi . 
This is because the XOR operations between Wj and S pro¬ 
duce a random graph that is independent of G. This con¬ 
cludes our proof. □ 


5.2 Watermark Detectability 

In addition to providing uniqueness, a practical water¬ 
mark design should also offer low detectability, i.e. with 
low probability each watermark gets identified by external 
users/attackers. This means that without knowing the seed 
ilj associated with user i, the embedded watermark graph 
S Wi should not be easily distinguishable from the rest of the 
graph G Wi . Therefore, the detectability would depend heav¬ 
ily on the topology of the original graph G, i.e. a watermark 
graph can be well hidden inside a graph G Wi if its structural 
property is not too different from that of G. 

In the following, we examine the detectability of wa¬ 
termarks in terms of a graph’s suitability for watermark¬ 
ing. This is because directly quantifying; the detectability 
is not only highly computational expensivqj. but also lacks a 
proper metric. Instead, we cross-compare the key structural 
properties of S Wi and G, and define G as being suitable for 
watermarking if its structure properties are similar to that of 
S Wi , implying a low watermark detectability. 

Suitability for Watermarking. To evaluate a graph’s 
suitability for watermarks, we first study the key structure 
property of the embedded watermark graph S Wi . To guaran¬ 
tee watermark uniqueness and minimize distortion, the wa¬ 
termark graph S Wi needs to be a random graph with an edge 
probability of i (except for the fixed edges between xy. x i+ i 
node pairs), and include k = (2 + 5) log 2 n nodes. Thus its 
average node degree is at least (fc + l)/2 and its average 
graph density is ((*) + k — l)/2. 


'Each embedded watermark graph is similar to a random graph 
with i edge probability. Thus the detectability is low if certain 
subgraphs of G are also random graphs with similar edge probabili¬ 
ties. Yet identifying these subgraphs (and the embedded watermark 
graph) on a large graph incurs significant computation overhead. 


Table 2: Size and density of subgraph on nodes with de¬ 
gree > (fc+l)/2 in each graph. Size is the number of sub¬ 
graph nodes, and density is quantified as average edges 


each node having inside the subgraph. 


Graph 

Subgraph 

Watermark Graph 

Suitability 

Node # 

Avg. Deg. 

k 

Avg. Deg. 

Russia 

4,794 

22.2 

39 

20.0 

Yes 

L.A. 

196,174 

49.2 

45 

23.0 

Yes 

London 

562,075 

56.1 

48 

24.5 

Yes 

Epinions (1) 

7,083 

68.7 

38 

19.5 

Yes 

Slashdot (08/11/06) 

9,908 

53.4 

38 

19.5 

Yes 

Twitter 

34,014 

60.5 

38 

19.5 

Yes 

Slashdot (09/02/16) 

10,065 

53.0 

38 

19.5 

Yes 

Slashdot (09/02/21) 

10,105 

53.2 

38 

19.5 

Yes 

Slashdot (09/02/22) 

10.605 

53.4 

38 

19.5 

Yes 

GPlus 

68.828 

347.1 

39 

20.0 

Yes 

Epinions (2) 

10,363 

83.5 

40 

20.5 

Yes 

Youtube 

31,720 

45.1 

47 

24.0 

Yes 

Pokec 

564,001 

53.0 

48 

24.5 

Yes 

Flickr 

136,202 

174.5 

48 

24.5 

Yes 

Livejournal 

945,567 

57.5 

52 

26.5 

Yes 

Patents 

2,370 

15.6 

34 

17.5 

Yes 

ArXiv (Theo. Cit.) 

12,054 

43.4 

34 

17.5 

Yes 

ArXiv (Phy. Cit.) 

14,785 

37.9 

35 

18.0 

Yes 

ArXiv (Phy.) 

2,860 

62.5 

32 

16.5 

Yes 

ArXiv (Astro) 

6,536 

42.9 

33 

17.0 

Yes 

DBLP 

15,004 

17.3 

43 

22.0 

Yes 

ArXiv (Condense) 

178,455 

16.0 

51 

26.0 

Yes 

Email (Enron) 

3,481 

48.2 

35 

18.0 

Yes 

Email (Europe) 

1,779 

44.0 

42 

21.5 

Yes 

Wiki Talk 

21,253 

83.1 

49 

25.0 

Yes 

Stanford 

35,600 

42.1 

42 

21.5 

Yes 

NotreDame 

16,831 

38.7 

43 

22.0 

Yes 

BerkStan 

110,202 

57.0 

45 

23.0 

Yes 

Google 

55,431 

14.8 

46 

23.5 

Yes 

Brightkite 

4,586 

30.8 

37 

19.0 

Yes 

Gowalla 

17,946 

39.3 

41 

21.0 

Yes 

Oregon (1) 

264 

17.1 

31 

16.0 

Yes 

Oregon(2) 

579 

31.0 

32 

16.5 

Yes 

CAIDA 

575 

16.0 

34 

17.5 

Yes 

Skitter 

146,601 

50.0 

48 

24.5 

Yes 

Gnutella (02/08/04) 

796 

5.2 

31 

16.0 

No 

Gnutella (02/08/25) 

499 

2.0 

34 

17.5 

No 

Gnutella (02/08/24) 

709 

2.7 

34 

17.5 

No 

Gnutella (02/08/30) 

1,001 

3.8 

35 

18.0 

No 

Gnutella (02/08/31) 

1,276 

3.6 

37 

19.0 

No 

Amazon (03/03/02) 

3,727 

2.8 

42 

21.5 

No 

Amazon (2012) 

5,318 

2.5 

43 

22.0 

No 

Amazon (03/03/12) 

25,717 

6.7 

43 

22.0 

No 

Amazon (03/06/01) 

28,081 

7.3 

43 

22.0 

No 

Amazon (03/05/05) 

28,044 

7.5 

43 

22.0 

No 

Pennsylvania 

0 

0 

47 

24.0 

No 

Texas 

0 

0 

47 

24.0 

No 

California 

0 

0 

49 

25.0 

No 


Given these properties of the embedded watermark, we 
note that watermark node degree and density can be higher 
than those of many real-world graphs, such as those listed 
in Table [T] Intuitively, to ensure low detectability of such 
a watermark graph, suitable graphs should include a set of 
nodes ( D ) which are difficult to distinguish from the water¬ 
mark nodes in term of node degree and subgraph density. 
Specifically, a suitable graph dataset needs to contain a set 
of nodes D with degree comparable or higher than the wa¬ 
termark graph node degree; and the density of the subgraph 
on D is at least comparable to the watermark graph den¬ 
sity. If these two properties hold, the embedded watermark 
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graph cannot be easily distinguished from D in the graph, 
and therefore cannot be detected by attackers. 

To capture the above intuition, we define that a graph G 
is suitable for watermarking if its node degree and graph 
density satisfy the following two criteria. First, the mini¬ 
mum and maximum node degree of G, denoted as N m i n (G ) 
and N max (G) respectively, need to satisfy N m i n (G) < 
(k + l)/2 < N max (G). Second, across all fc-node sub¬ 
graphs of G whose node degree expectation is greater than 
(fc+l)/2, the minimum and maximum graph density need to 
satisfy D min (k ) < ((*) + k- l)/2 < D max (k). Together, 
these two criteria ensure that the embedded watermark graph 
can be “well hidden” inside G Wi . 

To compute D m i n {k) and D max (k), we need to enumer¬ 
ate all possible subgraphs of G, which is computationally 
prohibitive for large graphs. Thus we apply a sampling 
method to estimate them. To estimate D rnax (k), we iden¬ 
tify the subgraph with the highest density using a greedy 
search: starting from a randomly chosen node Vi with degree 

> (k + l)/2, pick the 2nd node vi with degree > (k + l)/2 
that is connected to V\, then the 3rd node V3 with degree 

> (k + 1)/2 who has the most number of edges to v± andt> 2 . 
This greedy search stops until we find k nodes. We repeat the 
same process for all the nodes with degree > (k + l)/2, cre¬ 
ating multiple subgraphs from which we calculate the den¬ 
sity and pick the highest one. To estimate D m i n (k), we ap¬ 
ply a similar process to locate multiple subgraphs except that 
for each subgraph we locate the next node Uj+i randomly as 
long as its node degree > (k + l)/2 and it connects to at 
least one of the existing nodes {v\, ...Vi}. 

Suitability of Real Graph Datasets. We wanted to un¬ 
derstand how restrictive our suitability constraints were in 
the context of real graph datasets available today. We con¬ 
sider 48 real network graphs ranging from 10 K nodes, 39 K 
edges to 5 M nodes and 48M edges. These graphs repre¬ 
sent vastly different types of networks and a wide range of 
structural topologies. They include 3 social graphs generated 
from Facebook regional networks matching Russia, L.A., 
and London ED. They include 12 other graphs from on¬ 
line social networks, including Twitter ED, Youtube M, 
Google+ B2T1 . Slovakia Pokec Il39l . Flickr {261 . Livejour- 
nal l26l . 2 snapshots from Epinions ll33l . and 4 snapshots 
from Slashdot ED- We also add 3 citation graphs from 
arXiv and U.S. Patents ED, 4 graphs capturing collabo¬ 
rations in arXiv El and DBLP 1441 . 3 communication 
graphs generated from 2 Email networks mu ED and Wiki 
Talk ED, 4 web graphs ESI E, 2 location-based online 
social graphs from Brightkite and Gowalla ED. 5 snap¬ 
shots of P2P file sharing graph from Gnutella E71 . 4 In¬ 
ternet Autonomous System (AS) maps Ell, 5 snapshots of 
Amazon co-purchasing networks ESI @4), and 3 U.S. road 
graphs ESI . The statistics of all graphs are listed in Table E 

For all graphs, we use S = 0.3 to ensure a 99.999% wa¬ 
termark uniqueness, and compute and list the corresponding 
value of k (from Equation E in Table Q] Next we list the 


two criteria in terms of (fc +1)/2 vs. [N min (G), N max {G)\, 
and (( 2 ) + k- l)/2 vs. [D min (k), D max (k)\. If a graph 
satisfies both criteria, our analytical results will hold for any 
watermarks embedded on it. 

We can make two observations based on results from Ta¬ 
ble E First, 35 out of our 48 total graphs are suitable for 
watermarking. Also note that graphs describing similar net¬ 
works are consistent in their suitability. For example, all 
15 graphs from various online social networks are suitable 
for watermarks. Second, all 13 graphs unsuitable for water¬ 
marks come from only 3 kinds of networks, i.e. Amazon 
copurchasing networks, P2P networks, and Road networks. 
These results in each group are self consistent. These results 
support our assertion that our proposed watermarking mech¬ 
anism is applicable to most of today’s network graphs with 
low detection risk. In practice, the owner of a graph can ap¬ 
ply the same mechanism to determine if her graph is suitable 
for our watermark scheme. 

To understand key properties determining whether a 
graph is suitable for watermarking, we measure various 
graph structrual properties, including average node degree, 
node degree distribution, clustering coefficient, average path 
length, and assortativity. We also consider the size and den¬ 
sity of subgraphs on nodes with degree more than watermark 
minimum average degree (k + l)/2. Our measurement re¬ 
sults show that the size and density of subgraphs on nodes 
with degree > (k + l)/2 are the most important properties 
to determine suitability. Here, the size of these subgraphs 
is the number of nodes in the subgraph, and the density of 
the subgraph is measured as the average edges each node 
has inside the subgraph, i.e. average degree inside the sub¬ 
graph. As shown in Table E unsuitable graphs do not have 
subgraphs with density to comparable to watermarks, while 
subgraphs with the desired density can be found in graphs 
deemed suitable. These results are consistent with our intu¬ 
ition on quantifying suitability of watermarks. 

Summary. Since the average watermark subgraph has 
high node degree and density, a graph suitable for water¬ 
marking must include a set of nodes, whose degree and sub¬ 
graph density are comparable or even higher than watermark 
subgraphs. We propose two criteria targeting at node de¬ 
gree and subgraph density respectively to quantify whether 
a graph is suitable for watermarking. We collect a large set 
of available graph datasets today, and find that 35 out of 48 
real graphs are suitable for watermarking. This promising 
result indicates that watermark technique can be applied on 
most of real networks with low probability to be identified. 

6. MORE ROBUST WATERMARKS 

Our basic design provides the fundamental building 
blocks of graph watermarking with little consideration of 
external attacks. In practice, however, malicious users can 
seek to detect or destroy watermarked graphs. Here, we first 
describe external attacks on watermarks, and then present 
advanced features that defend against the attacks. Note that 
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these improvement techniques aim to increase the cost of at¬ 
tacks rather than disabling them completely. Finally, we re¬ 
evaluate the watermark uniqueness of the advanced design. 

6.1 Attacks on Watermarks 

As discussed earlier, our attack model includes attacks 
trying to destroy watermarks while preserving the topology 
of the original graph. Based on the number of attackers, 
attacks on watermarks fall under our two attack models: sin¬ 
gle attacker and multiple colluding attackers. With access 
to only one watermarked graph, a single attacker can modify 
nodes and/or edges in the graph to destroy watermarks. With 
multiple watermarked graphs, colluding attackers can per¬ 
form more sophisticated attacks by cross-comparing these 
graphs to detect or remove watermarks. 

Single Attacker Model. The naive edge attack is easiest 
to launch, and tries to disrupt the watermark by randomly 
adding or removing edges on the watermarked graph. For 
the attacker, there is a clear tradeoff between the severity 
of the attack (number of edges or nodes modified), and the 
structural change or distortion applied to the graph structure. 

At first glance, this attack seems weak and unlikely to be 
a real threat. The probability of the attacker modifying one 
edge or node in the embedded watermark graph Wi is ex¬ 
tremely low, given the relatively small size of W t compared 
to the graph. As shown later, however, this attack can be 
quite disruptive in practice. By modifying a node m or an 
edge connected to n*, the attack impacts all of n, ’s neighbor¬ 
ing nodes, since their NSD labels will be modified. These 
NSD label changes, while small, are enough to make locat¬ 
ing nodes in the watermark graph very difficult. This effect 
is exacerbated in social graphs that exhibit a small world 
structure, since any change to a supernode’s degree will im¬ 
pact a disproportionately large portion of nodes in the graph. 

One extreme of this attack is to leak patial watermarked 
graphs or merge several graphs together. With high probabil¬ 
ity, it can destroy the embedded watermarks, but will signif¬ 
icantly distort the graph topologies to reduce their usability. 
Thus, we do not consider such scenarios in our study. 

Collusion Attacks. By obtaining multiple watermarked 
graphs, an attacker can compare these graphs to eliminate 
watermarks. Since we anonymize each watermarked graph 
by randomly reassigning node IDs (see Section 4.1), attack¬ 
ers cannot directly match individual nodes across graphs. 
To compare multiple graphs, we apply the deanonymization 
methods proposed in j27l [28l . Specifically, we first match 
1000 highest degree nodes between two graphs based on 
their degree and neighborhood connectivities l28l , and then 
start from these nodes to find new mappings with the net¬ 
work structure and the previously mapped nodes ||27| . 

Using the deanonymization method, attackers can then 
build a "cleaned” graph, where an edge exists if it exists 
in the majority of the watermarked graphs. Since embed¬ 
ded watermark graphs are likely embedded at different lo¬ 
cations on each graph, a majority vote approach effectively 


removes the contributions from watermark subgraphs, lead¬ 
ing to a graph that closely approximates the original G. 

6.2 Improving Robustness against Attacks 

The attacks discussed above can disrupt the watermark 
extraction process in two ways. First, adding or delet¬ 
ing nodes/edges in G' changes node degrees, and therefore 
nodes’ NSD labels, thereby disrupting the identification of 
candidate nodes during the second step of the extraction pro¬ 
cess; second, adding or deleting nodes/edges inside the em¬ 
bedded watermark graph S Wi can change the structure of 
the watermark graph, making it difficult to identify during 
the third step of the extraction process. To defend against 
these attacks, we must make the watermark extraction pro¬ 
cess more robust against attack-induced artifacts on both 
node and graph structure. To do so, we propose four im¬ 
provements over the basic extraction design in Section l4~2l 

Improvements #1, #2: Addressing changes to node neigh¬ 
borhoods. Extracting a watermark involves searching 
through nodes in G' by their NSD labels. By adding or 
deleting nodes/edges, attackers can effectively change NSD 
labels across the graph. To address this, we propose two 
changes to the basic extraction design. First, we bucketize 
node degrees (with bucket size B) to reduce the sensitivity 
of a node’s NSD label to its neighbors’ node degrees. For 
example, with B = 5, a node with degree 9 will stay in 
the same bucket even if one of its edges has been removed 
(reducing its node degree to 8). Second, when selecting a 
watermark node’s candidate node list, we replace the exact 
NSD label matching with the approximate NSD label match¬ 
ing. That is, a match is found if the overlap between two 
bucketized NSD labels exceeds a threshold 9. For example, 
with 9 = 50%, a node with bucketized NSD label “1-2-3-4” 
would match a node with label “1-2-3” since the overlap is 
75% > 9. 

These changes clearly allow us to identify more candidate 
nodes for each watermark node, thus improving robustness 
against small local modifications. On the other hand, more 
candidate nodes lead to more computation during the sub¬ 
graph matching step, i.e. step 3 in the extraction process. 
Such expansion, however, does not affect watermark unique¬ 
ness and detectability, since they are unrelated to the size of 
candidate pools. 

Improvement #3, #4: Addressing changes to subgraph 
structure. Random changes made to G' by an attacker 
has some chance of directly impacting a node or edge in 
the embedded watermark. To address this, we propose two 
techniques. First, we add redundancy to watermarks by em¬ 
bedding the same watermark graph Wi into m disjoint sub¬ 
graphs .S'i . S- 2 ,... S m from the original graph G. This greatly 
increases the probability of the owner locating at least one 
unmodified copy of Wi during extraction, even in the pres¬ 
ence of attacks that make significant changes to nodes and 
edges in G'. Note that since we embed watermarks on dis¬ 
joint subgraphs, this does not affect watermark uniqueness 
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1 — !’(£). While embedding m watermarks will impact false 
positive, which is 1 — (1 — P(£)) m . 

Second, it is still possible that all the watermark graphs are 
“destroyed” by the attacker and there are no matches in the 
extraction process. If this happens, we replace the exact sub¬ 
graph matching in the step 3 of the extraction process with 
the approximate subgraph matching. That is, a subgraph 
matches the watermark graph if the amount of edge differ¬ 
ence between the two is less than a threshold L. By relaxing 
the search criteria used in step 3 of the extraction process, 
this technique allows us to identify “partially” damaged wa¬ 
termarks, thus again improving robustness against attacks. 
However, it can also increase false positives in watermark 
extraction, reducing watermark uniqueness. We show later 
in this section that the impact on watermark uniqueness can 
be tightly bounded by controlling L. 

Improvement #5: Addressing Collusion Attacks. Re¬ 
call that for powerful attackers able to match graphs at an in¬ 
dividual node level, they can leverage majority votes across 
multiple watermarked graphs to remove watermarks. To de¬ 
fend against this, our insight is to embed watermarks that 
have some portion of spatial overlap in the graph, such that 
those components will survive majority votes over graphs. 

We propose a hierarchical watermark embedding process 
to protect watermark(s) against collusion attacks. To build 
watermarked graphs for M users, we uniform-randomly di¬ 
vide these M users into 2 groups (a, \ and a 2 ) and associate 
each group with a public-private key pair < K p ^ b , K p l iv > 
or < K p 2 b ,K p 2 iv >, which is generated and held by the 
data owner. We repeat this to produce another group par¬ 
tition and randomly divide M users into 2 groups ( b\ and 
b 2 ) associated with group key pairs < K^ ib ,K^ iv > and 
< Kp 2 ub , K p 2 iv > separately. After this step, each user is 
assigned to two groups. For example, a user i is assigned to 
groups ai and b 2 . 

To prevent the data owner or users from forging group as¬ 
signments, we modify step 1 in Section |4~TI to achieve an 
agreement on group assignments between the data owner 
and each user. More specifically, at time T when the data 
owner tends to share its graph with a user i assigned to two 
groups, e.g. groups a\ and b 2 , the data owner first send user 
i three items: current timestamp T and two group signatures 
K priv( T ) and K p 2 riv ( T )• User * then validates the two group 
signatures using the two group public keys and K p 2 ub . 
If the timestamps encrypted using group private keys are T, 
user i agrees the group assignment, saves the three items, 
and sends back its personal signature, i.e. K l priv (T)\ oth¬ 
erwise, user i rejects the group assignment. Once the data 
owner receives user i’s signature K priv (T), it validates this 
timestamp with user i’s public key. If it is valid, the data 
owner generates three seeds for user i : O, by combining 
K priv(. T ) and K °’ U ai b y combining K£ iv and K°, and 
flb 2 by combining K p 2 iv and K G , where K° is graph key 
for graph G. Through this agreement scheme, either the data 


owner or users can not forge their group assignments. More¬ 
over, since the generated seed for each group is unique, we 
can make sure that only one unique watermark corresponds 
to each group. 

To embed the watermarks for user i, we first follows step 
2 — 4 in Section |4~I1 to embed two group watermarks using 
its two group seeds generated through the above method, i.e. 
fl ai and 0/, 2 in the example. We then use user i’s individ¬ 
ual seed, i.e. f 2,, to embed an individual watermark. When 
generating the group watermarks, we make sure that 1) the 
group watermark remains the same for users in the same 
group; and 2) watermarks corresponding to different groups 
do not overlap with each other, or with each user’s individ¬ 
ual watermark graph. Note that because the group and in¬ 
dividual watermarks are generated with different seeds, this 
hierarchical embedding process does not affect watermark 
uniqueness. 

Under this design, a collusion attack can successfully de¬ 
stroy all the watermarks (group or individual) only if the ma¬ 
jority of the watermarked graphs come from different user 
groups. Otherwise, the majority vote on raw edges will pre¬ 
serve the “group watermark.” We can compute the success 
rate of the attack by the following equation, which repre¬ 
sents the probability that the majority of the graphs obtained 
by the attacker come from different user groups: 


A (M a ,J) 


1 - J 


lVl a 

E 


i=r 


Mq+1 1 



( 


J — 1 
J 



where M a is the number of watermarked graphs obtained by 
the attacker and J is the number of groups in each group 
partition. The above design chose J = 2 because it min¬ 
imizes X(M a , J),\/M a . Furthermore, when M a is odd, 
A (M a , 2) = 0; and when M a is even, A (M a , 2) is at most 
0.25 when M a = 2. Note that in equation (0 the operation 
(.) 2 is due to the fact that we group the users twice into two 
different group classes: 01,02 and b\,b 2 . If we only per¬ 
form the group partition once (e.g. dividing the users into 
ai, 02 ), then A(2, 2) = 0.5. This means that in practice we 
can further reduce A by performing multiple rounds of group 
division (2 in the above design) and adding more group wa¬ 
termarks. 

Note that group watermarks contain much less informa¬ 
tion than single user watermarks. In fact, the more robust a 
group watermark, the larger granularity (and less precision) 
it will provide. Our proposed solution is to extend the sys¬ 
tem by using additional “dimensions,” e.g. go beyond the 
two dimensions of a and b mentioned above. Combining 
results from multiple dimensions will quickly narrow down 
the set of potential users responsible for the leak. However, 
since a colluding attack requires the involvement of multi¬ 
ple leakers, even identifying a single leaker is insufficient. 
Developing a scheme to reliably detect multiple (ideally all) 
colluding users is a topic for future work. 
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6.3 Impact on Watermark Uniqueness 

To improve the robustness of our watermark system, we 
relax the subgraph matching criteria from exact matching to 
approximate matching with at most L edge difference. Such 
relaxation does not affect watermark detectability because it 
does not change the embedding process. However, it may 
affect watermark uniqueness, which we will analyze next. 

Consider two watermarked graphs G Wi and G Wj that 
were independently generated for user i and j following the 
three steps defined in Theorem |T] Let S Wi and S Wj repre¬ 
sent the embedded watermark graph in G Wi and G Wj , re¬ 
spectively. To examine the watermark uniqueness, we seek 
to compute the probability that a subgraph in G Wj differs 
from S Wi by at most L edges. 

Our analysis follows a similar structure of Theorem [l]s 
proof. Let £y denote the event where a subgraph of G Wj 
built on k nodes Y = {y\, t/ 2 ,..., yk} only differs from S Wi 
by at most L edges. Our goal is to calculate the probability of 
the event £ = U y£y, which is the union on all combinations 
of k nodes. To do so, we first compute the probability of 
individual £y ■ 

As shown in TheoremQ] the edges between ( 2 ) — (k — 1) 
node pairs in S Wi are generated randomly with probability 
7 j and are independent of G Wj , while the rest k — 1 edges (< 
Xi, Xi+i >, l = 1 ...k — 1) are fixed. Thus we can show that 
the probability that a subgraph G Wj [Y] differs from S Wi by 
h edges is upper bounded by Y fc+1 • (®) where e = ( 2 ). 
Therefore, we can derive the probability of £y as P{£y) < 
Y fc+i J2f l=0 (^). And consequently, we have 

i e-fc+i Jy / \ 

FW)<y.- eQ «> 

h =0 V 7 

where e = ( 2 ), k = (2 + S)log 2 n, and n is the node count 
of G Wj . 

Next, given the probability of uniqueness 1 — P{£), we 
compute the upper bound on L to ensure 1 — P(£) > 
0.99999 for all the graphs in Table Q] except Road graphs, 
Amazon graphs and P2P network graphs. Again we set 
S = 0.3. The result is listed in Table 0 where the maxi¬ 
mum limit of L varies between 0 and 12. In general, the 
larger the graph, the higher the upper bound on L. 

7. EXPERIMENTAL EVALUATION 

We evaluate the proposed graph watermarking system us¬ 
ing real network graphs. We consider three key perfor¬ 
mance metrics, false positive, graph distortion and water¬ 
mark robustness. Having analytically quantified the water¬ 
mark uniqueness in Section0and[6j we focus on examining 
graph distortion and watermark robustness while ensuring 
false positive less than 0.001%. We also study the computa¬ 
tional efficiency of the proposed watermark embedding and 
extraction schemes. 

Experiment Setup. Given the large number of graph 


Table 3: Upper bound of L for the 35 network graphs. 


Graph 

Oregon (1) 

Oregon (2) 

CAIDA 

Email 

(Enron) 

arXiv 

(Theo. Cit.) 

L Bound 

0 

1 

1 

1 

1 

Graph 

arXiv 

arXiv 

arXiv 

Patent 

Slashdot 

(Phy. Cit.) 

(Phy.) 

(Astro) 

(08/11/06) 

L Bound 

1 

1 

1 

2 

3 

Graph 

Twitter 

Slashdot 

(09/02/16) 

Slashdot 

(09/02/21) 

Slashdot 

(09/02/22) 

Brightkite 

L Bound 

3 

3 

3 

3 

3 

Graph 

Russia 

Epinions (1) 

Googled- 

Epinions (2) 

Standford 

L Bound 

4 

4 

4 

5 

5 

Graph 

Email 

(Europe) 

Gowalla 

BerkStand 

DBLP 

NorteDame 

L Bound 

5 

5 

6 

7 

7 

Graph 

L.A. 

London 

Flickr 

Wiki 

Google 

L Bound 

8 

8 

8 

8 

8 

Graph 

Skitter 

Youtube 

Pokec 

arXiv 

(Condense) 

Livejournal 

L Bound 

8 

9 

9 

11 

12 


Table 4: Percentage of modified nodes and edges after 
embedding 5 watermarks into a graph and impact on 
graph structure (dK-2 Deviation). 


Graph 

Nodes (%) 

Edges (%) 

dK-2 Deviation 

Watermarked LA 

0.037% 

0.033% 

0.0008 

Watermarked Flickr 

0.014% 

0.019% 

0.0001 


computations per data point, we focus our experiments on 
two of the larger network graphs listed in Table [T] the LA 
regional Facebook graph and the Flickr network graph. The 
two graphs have very different sizes and graph structures. To 
guarantee less than 0.001% false positive, we select S = 0.3, 
and the k values for the LA and Flickr graphs are 45 and 48, 
respectively. For our basic design, we generate 1 watermark 
per graph. For our advanced design, we set L to 8, the degree 
bucket size to 10, and the NSD similarity threshold to 9 — 
0.75. For each user, we embed 5 watermarks in its graph, 
3 as individual watermarks and 2 as group watermarks. We 
chose these settings because they are intuitive and work well 
in practice. We leave the optimization of these parameters to 
future work. 

In the following, we present our experiment results in 
terms of 1) amount of distortion introduced to the original 
graph due to watermarking, 2) robustness of the watermark 
against attacks, and 3) computational efficiency of our wa¬ 
termarking design. 

7.1 Graph Distortion from Watermarks 

We consider three types of metrics for measuring the 
graph distortion from watermarks. 

• Modifications to the raw graph - We count the number of 
nodes and edges modified by embedding watermarks. In¬ 
tuitively, more modifications to the graph introduce higher 
distortion. 

• Deviation in the dK-2 distribution - We also measure the 
Euclidean distance between the dK-2 series of the original 
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Table 5: Graph metrics are consistent w/ and w/o water- 
marks. _ 


Graph 

AS 

Avg. CC 

Avg. Deg 

Avg. Path 

Dia. 

LA Original 

0.21 

0.19 

25.4 

46 

14 

Watermarked 

0.21 

0.19 

25.4 

4.6 

14 

Flickr Original 

-0.02 

0.18 

18.1 

5.3 

21 

Watermarked 

-0.02 

0.18 

18.1 

5.3 

21 


graph and that of the watermarked graphfl Larger devi¬ 
ation in dK-2 series implies higher distortion to the graph 
structure. 

• Graph metrics w/ and w/o watermarks - Finally we mea¬ 
sure the commonly used graph metrics before and after the 
watermarking, including node degree distribution, assorta- 
tivity (AS) l36l , clustering coefficient (CC) lf36l . average 
path length and diameter. Any large deviation in any of 
these metrics indicates that the watermarked graph experi¬ 
enced large distortion. 

We have examined the distortion introduced by both the 
basic and advanced designs. We only show the results of the 
advanced design because it adds more watermarks and thus 
leads to higher distortion. For both LA and Flickr graphs, 
we generate 10 different watermarked graphs (using 10 dif¬ 
ferent random generator seeds) and present the average re¬ 
sult across these graphs. Because computing average path 
length and diameter on these two large graphs is highly com¬ 
putational intensive, we randomly sample 1000 nodes and 
compute the average path length and diameter among them 
(following the same approach taken by prior works on social 
graph analysis El). 

Table[4]shows the percentage of modified nodes and edges 
by watermarking. Even after embedding 5 watermarks, the 
modification is less than 0.04% for LA and 0.02% for Flickr. 
These small changes imply little distortion on the water¬ 
marked graphs. This is further confirmed by the average 
dK-2 distances for both graphs, 0.0008 for LA and 0.0001 
for Flickr, indicating that the watermarked graphs are highly 
similar to the original graph. 

Table 0 compares the original and watermarked graphs 
in terms of five representative graph metrics. Similarly, for 
both LA and Flickr, the graph metrics remain the same be¬ 
fore and after watermarking. We also examined the statis¬ 
tical distribution of each metric and found no visible differ¬ 
ence between the graphs. 

Together, these results indicate that our proposed wa¬ 
termarking system successfully embeds watermarks into 
graphs with negligible impact on graph structure. This is 
unsurprising, given the extremely small size of watermarks 
relative to the original graphs. Thus we believe watermarked 
graphs can replace the originals in graph applications and 
produce (near-)identical results. 

“The E uclidean distance between two dK-2 seri es G 1 and G2 is defined 
by T> ^/E<d 1 ,d 2 >(e<di,d 2 > ~ e <dl,d 2 >) 2 where D is the number of 
< d \, c ^2 > combinations or entries in the dK-2 series. 


7.2 Robustness against Attacks 

Next, we investigate how the proposed watermarking sys¬ 
tem performs in the presence of attacks. For each of the two 
attack implementations discussed in Section [6Tl we vary the 
attack strength and examine the robustness against the attack 
as well as the cost of the attack. Specifically, we repeat each 
experiment for 10 times, and examine two metrics: 

• Robustness - in the single attacker model, the robustness 
is quantified as the ratio of graphs from which we can suc¬ 
cessfully extract at least one of the 3 individual watermarks. 
In the collusion attack, in addition to this ratio, we also 
measure the ratio of graphs where we can extract at least 
one of the 5 watermarks (3 individual + 2 group water¬ 
marks). 

• Cost of the attack - the normalized distortion produced on 
the attacked graph. It represents the Euclidean distance be¬ 
tween the dK-2 series of the attacked graphs and that of 
the original graph, normalized by the Euclidean distance 
between the dK-2 series of the clean watermarked graphs 
and that of the original graph. If the normalized distortion 
is larger than 1, the attack introduces more distortion than 
embedding the watermarks. 

Results on the Single Attacker Model. For the single at¬ 
tacker model, we quantify the attack strength by the number 
of modified edges. The robustness and the cost of the attack 
are measured as a function of the number of modified edges. 

To show how robustness is improved using the improve¬ 
ment mechanisms, we first evaluate the robustness results 
in the basic watermark scheme. We run the single attacker 
model on the watermarked graphs by varying the number of 
modified edge number, and repeat the experiment 10 times. 
The robustness here is quantified as the ratio of graphs from 
which we can successfully extract the watermark. 

Figure [2] shows the robustness of the basic watermark 
scheme against the single attacker model. It shows that ran¬ 
domly modifying a small number of edges disrupted the wa¬ 
termark subgraph extraction process. In LA, our basic de¬ 
sign cannot recover the watermark with 100% probability 
even when we modify 20 edges. In Flickr, a large graph, 
the robustness of the basic scheme reduces to less than 40% 
when only 500 edges are modified. In each case, at least 
one of the nodes in the watermark subgraph had a modi¬ 
fied NSD label (one of its neighbors’ node degree changed), 
and it could not be located in the extraction process. We 
also look at the distortion caused by the attack shown in Fig¬ 
ure [3] As expected, the small number of modified edges 
causes small distortions in graph structures. For example, 
in LA, when the robustness is 0, the distortion is around 3x 
more than that caused by embedding the watermark. Both 
results show that watermarked graphs generated by the ba¬ 
sic scheme are easily disrupted by even small, single user 
attacks. 

Figure [4j a)-(b) plot the robustness of watermarked LA 
and Flickr graphs generated by the scheme with the improve- 
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Figure 2: The robustness of the basic design against the sin¬ 
gle attacker model. 
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Figure 3: The distortion caused by the single attacker model 
in the basic design. 
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Figure 4: The robustness in the improved design against the 
single attacker model. 


Figure 5: The distortion caused by the single attacker model 
in the improved design. 
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Figure 6: The robustness against the collusion attack. Figure 7: The distortion caused by the collusion attack. 


ment mechanisms against the single attacker model. As 
expected, the robustness decreases with the attack strength 
since more edges are modified to “destroy” watermarks. For 
LA, our system maintains 100% robustness up to 230K mod¬ 
ified edges, which is around 400x stronger than the max¬ 
imum attack strength handled in the graph generated by 
the basic design. For Flickr, the system can handle at¬ 
tack strength up to 933K modified edges, which is > 400x 
stronger than the maximum attack strength in the basic de¬ 
sign. This is because Flickr is larger in size while having a 
similar watermark graph size k, so the attacker must modify 
more edges to destroy watermarks. On the other hand, re¬ 
sults in Figure 0 show that the cost of these attacks is large. 
For Flickr, with more than 1.4M modified edges, an attack 
leads to 800x more distortion over that caused by embedding 
the watermarks. Together, these results show that our water¬ 
mark system with the improvement mechanisms is highly 
robust against single user attacks. 

Results on Collusion Attacks. To implement the collu¬ 
sion attack desbribed in Section lfTTl we first generate 10 wa¬ 
termarked graphs and randomly pick M a graphs from them 
as the graphs acquired by the attacker. We vary the number 
of graphs obtained by the attacker M a between 2 to 5. For 


each M a value we repeat the experiments 10 times and re¬ 
port the average value. Since watermarks generated by the 
basic design can be easily detected and removed by the pow¬ 
erful collusion attack, here we focus on evaluating the ro¬ 
bustness of the improvement mechanisms. 

Figure [6ja)-(b) shows the robustness of the watermarked 
LA and Flickr graphs against the collusion attack. Fig¬ 
ure 0 a) shows that in LA, by applying majority votes on 
raw edges, the collusion attack can effectively remove all 3 
individual watermarks. However, the attack is ineffective in 
removing both group watermarks such that we can extract at 
least one group watermark in more than 60% of the attacked 
graphs. Here the robustness values, deviate slightly from 
that projected by Equation (0 because we limit the number 
of statistical sampling to 10 runs. Unlike LA, Figure 0b) 
plots that the collusion attack cannot remove all the individ¬ 
ual watermarks in Flickr when using 2 or 3 watermarked 
graphs. This is all because the deanonymization method 
causes a large portion of nodes mismatched in Flickr ( 30% 
nodes). Finally, Figure0shows that the collusion attack also 
introduce larger distortions in graph structure. This mainly 
comes from the mismatch of the deanonymization methods. 

These results show that even a powerful collusion attack 
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is ineffective in removing all the embedded watermarks. 
Moreover, the potential inaccuracy of the deanonymization 
method causes the collusion attack even weaker in removing 
individual watermarks. Of course, the attackers will even¬ 
tually succeed in disrupting watermarks if they are willing 
to modify larger portions of the graph, thus sacrificing the 
utility of the graph. While our work provides a robust de¬ 
fense against attackers with relatively low level of tolerance 
for graph distortion, we hope follow-on work will develop 
more robust defenses against higher distortion attacks. 

7.3 Computational Efficiency 

Here, we measure the efficiency of the watermarking sys¬ 
tem. There are two components in the watermarking system, 
i.e. watermark embedding and watermark extraction. The 
time to extract a watermark is the time to run step 2 and 3 
in watermark extraction, i.e. candidate selection and water¬ 
mark identification. 

To accelerate the extraction process, we parallelize the key 
steps across multiple servers. More specifically, in the can¬ 
didate selection step, any available servers are assigned an 
unchecked watermark node to find its candidates. In step 3, 
each available server will be assigned to search one water¬ 
mark from one candidate of watermark node x\. When a 
watermark is found or no more candidates are unchecked, 
the extraction process stops (for that user). 

We perform measurements to quantify the actual impact 
of parallelizing extraction over a cluster. All system param¬ 
eters are the same as previous tests, except that we embed 1 
watermark into a graph. To extract watermarks, we compare 
the improved watermark extraction method to the basic ex¬ 
traction method, with bucket size 10 and NSD similarity of 
0.75 in the improved extraction method. In addition to the 
two graphs, i.e. Flickr and LA, we also measure efficiency 
on the largest graph in our study (Livejournal, 5.2 million 
nodes, 49 million edges), shown in Table Q] We parallelize 
watermark extraction across 10 servers, each with 2.33GHz 
Xeon servers with 192GB RAM. All experiments repeat on 
10 different watermarked graphs, and the time is the average 
of the 10 computation time. 

First result in Table [6] is that watermarking system is ef¬ 
ficient in embedding and extracting watermarks. On aver¬ 
age, embedding one watermark into a graph is very fast. 
For example, average embedding time for the largest graph, 
Livejournal, is around 12 minutes and embedding times for 
Flickr and L.A. are less than 2 minutes. Even using one 
server to extract watermarks, the computation time is small. 
Like in Flickr, the extraction time is around 13 minutes using 
both the basic method and the improved method. From our 
observation, the time to identify the watermark graph on the 
candidate subgraphs is much less than the time required to 
find and filter candidates, which corresponds to around 99% 
of total computation time. Since finding candidates takes 
0{kn ) computational complexity and k = (2+5) log 2 n, the 
complexity to extract a watermark from a real-world graph 


Table 6: The efficiency of the watermarking system, in¬ 
cluding watermark embedding time on one server, the 
extraction time on one server and the parallel extraction 
time across 10 servers using basic watermark extraction 
method and improved watermark extraction method. 


Graph 

Embedding (s) 

Basic Extraction 

Improved Extraction 

Single(s) 

Parallel (s) 

Single (s) Parallel (s) 

LA 

40 

270 

39 

310 

42 

Flickr 

80 

767 

195 

776 

197 

Livejournal 

695 

2568 

310 

2605 

317 


is 0(n log 2 n). Here k is the number of nodes in the water¬ 
mark graph and n is the number of nodes in the total graph. 

Second, we find that speedup from distributed extraction 
is quite good, with speedup of 8 over 10 servers for Livejour¬ 
nal and 7 for LA (for both extraction methods). The speedup 
for Flickr is only around 4 using both methods, because one 
of the watermarked graphs takes much longer time than oth¬ 
ers in finding candidates, ~ 10 minutes. This is almost 4 
times longer than the average extraction time on the other 
graphs. Not counting this outlier, average parallel extrac¬ 
tion time on Flickr is around 150 seconds for both methods, 
which is 5 times faster than using one server. This is because 
the core computation is finding candidates, and completion 
time can vary when computing the similarity of NSD be¬ 
tween watermark nodes and graphs nodes, which depends 
on node degree. The higher the degree is, the longer it takes 
for the similarity computation. Since there are several Flickr 
watermark nodes of high degree, time to find candidates is 
relatively longer. 

Finally, comparing the two extraction methods, there is no 
significant difference between their computation time. This 
is because the extraction time of both methods are dom¬ 
inated by the time to find and filter candidates, which is 
0(n log 2 n) for both methods. 

Summary. We evaluate the efficiency of the graph wa¬ 
termark embedding and extraction algorithms on three real- 
world graphs with 600A' ~ 5 M nodes and 7 M ~ 48 M 
edges. The results show that the embedding process is fast 
even for large graphs, and only takes up to 12 minutes to em¬ 
bed a watermark into a graph with 5 M nodes. In the extrac¬ 
tion process, the time to identify watermark graphs on the 
set of pre-filtered candidate nodes is much less than the time 
to filter candidate nodes, whose complexity is 0(n log 2 n). 
Our experimental results also show that on a single commod¬ 
ity server, the extraction time is at most 43 minutes in a 5 M- 
node graph, and can be future reduced to less than 5 minutes 
by distributing the computation across multiple servers. 

8. CONCLUSION 

In this paper, we take a first step towards the design 
and implementation of a robust graph watermarking system. 
Graph watermarks have the potential to significantly impact 
the way graphs are shared and tracked. Our work identifies 
the critical requirements of such a system, and provides an 
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initial design that targets the critical properties of unique¬ 
ness, robustness to attacks, and minimal distortion to the 
graph structure. We also identify key attacks against graph 
watermarks, and evaluate them against an improved design 
with additional features for improved robustness under at¬ 
tack. 

Our evaluation shows that our initial watermarking sys¬ 
tem modifies very few nodes and edges in a graph, i.e. less 
than 0.04% nodes and edges in a graph with 603K nodes 
and 7.6M edges. Results also demonstrate extremely low 
distortion, i.e. the watermarked graphs are highly consis¬ 
tent with the original graph in all graph metrics we consid¬ 
ered. Empirical tests on several real, large graphs show that 
our robustness features dramatically improved our resilience 
against both single and multi-user collusion attacks. Finally, 
we show that the embedding process and the extraction pro¬ 
cess are efficient, and the extraction process is easily paral¬ 
lelized over a computing cluster. 

While our proposed scheme achieves many of our initial 
goals, there is significant room for improvement and on¬ 
going work. One focus is developing stronger redundancy 
schemes to protect against attackers with a greater tolerance 
for graph distortion, i.e. willing to make a greater number of 
node/edge changes. Another is to develop alternate schemes 
that can recover more information about multiple attackers 
in the colluding attack model. 
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